benign image
Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback T aeHo Y oon
Diffusion models have recently shown remarkable success in high-quality image generation. Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images. If so, we simply need to prevent the generation of the bad images, and we call this task censoring. In this work, we present censored generation with a pre-trained diffusion model using a reward model trained on minimal human feedback. We show that censoring can be accomplished with extreme human feedback efficiency and that labels generated with a mere few minutes of human feedback are sufficient.
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Appendix
In Section 3.1, we empirically analyze the differences between SINs of patched and benign images. The benign image set consists of 10000 images randomly selected from ImageNet validation set. We analyze the distribution of SINs in terms of both the standard deviation distance and cluster number. Compared to Equation (3), a penalty loss of activation value is considered. The insights of ScaleCert are shown in Figure 5. Figure 5(a) illustrates the superficial neuron It leverages the weight of deep features (activation in the last convolutional layer) to indicate the importance of deep features for specific classes. Therefore, the discrimination of deep features in the benign images and adversarial images is not intuitive as that in superficial features (as shown in Figure 5(a, b, c)).
- North America > United States > Texas > Kleberg County (0.05)
- North America > United States > Texas > Chambers County (0.05)
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.67)
- North America > United States > Texas > Kleberg County (0.05)
- North America > United States > Texas > Chambers County (0.05)
On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling
Wu, Stanley, Bhaskar, Ronik, Ha, Anna Yoo Jeong, Shan, Shawn, Zheng, Haitao, Zhao, Ben Y.
Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions. In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong "dirty-label" poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure).
Object-oriented backdoor attack against image captioning
Li, Meiling, Zhong, Nan, Zhang, Xinpeng, Qian, Zhenxing, Li, Sheng
Backdoor attack against image classification task has been widely studied and proven to be successful, while there exist little research on the backdoor attack against vision-language models. In this paper, we explore backdoor attack towards image captioning models by poisoning training data. Assuming the attacker has total access to the training dataset, and cannot intervene in model construction or training process. Specifically, a portion of benign training samples is randomly selected to be poisoned. Afterwards, considering that the captions are usually unfolded around objects in an image, we design an object-oriented method to craft poisons, which aims to modify pixel values by a slight range with the modification number proportional to the scale of the current detected object region. After training with the poisoned data, the attacked model behaves normally on benign images, but for poisoned images, the model will generate some sentences irrelevant to the given image. The attack controls the model behavior on specific test images without sacrificing the generation performance on benign test images. Our method proves the weakness of image captioning models to backdoor attack and we hope this work can raise the awareness of defending against backdoor attack in the image captioning field.
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models
Zhou, Jiachen, Lv, Peizhuo, Lan, Yibing, Meng, Guozhu, Chen, Kai, Ma, Hualong
Dataset sanitization is a widely adopted proactive defense against poisoning-based backdoor attacks, aimed at filtering out and removing poisoned samples from training datasets. However, existing methods have shown limited efficacy in countering the ever-evolving trigger functions, and often leading to considerable degradation of benign accuracy. In this paper, we propose DataElixir, a novel sanitization approach tailored to purify poisoned datasets. We leverage diffusion models to eliminate trigger features and restore benign features, thereby turning the poisoned samples into benign ones. Specifically, with multiple iterations of the forward and reverse process, we extract intermediary images and their predicted labels for each sample in the original dataset. Then, we identify anomalous samples in terms of the presence of label transition of the intermediary images, detect the target label by quantifying distribution discrepancy, select their purified images considering pixel and feature distance, and determine their ground-truth labels by training a benign model. Experiments conducted on 9 popular attacks demonstrates that DataElixir effectively mitigates various complex attacks while exerting minimal impact on benign accuracy, surpassing the performance of baseline defense methods.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > China > Beijing > Beijing (0.04)
Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback
Yoon, TaeHo, Myoung, Kibeom, Lee, Keon, Cho, Jaewoong, No, Albert, Ryu, Ernest K.
Diffusion models have recently shown remarkable success in high-quality image generation. Sometimes, however, a pre-trained diffusion model exhibits partial misalignment in the sense that the model can generate good images, but it sometimes outputs undesirable images. If so, we simply need to prevent the generation of the bad images, and we call this task censoring. In this work, we present censored generation with a pre-trained diffusion model using a reward model trained on minimal human feedback. We show that censoring can be accomplished with extreme human feedback efficiency and that labels generated with a mere few minutes of human feedback are sufficient.
- Asia > South Korea > Seoul > Seoul (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)